『Reasoning with Language Model Prompting: A Survey』

https://gyazo.com/763b023ae2b87594796e561efcbb86b5

https://arxiv.org/pdf/2212.09597.pdf

research works with comparisons and sum maries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions.

研究では、比較や要約を取り扱い、初心者を支援するための体系的なリソースを提供しています。また、そのような推論能力が発展する潜在的な理由についても議論し、今後の研究の方向性を提案しています。

1 Introduction

Reasoning ability lies at the heart of human intel ligence, yet in natural language processing (NLP), modern neural networks can hardly reason from what they are told or have already known (Duan et al., 2020; Wang et al., 2021; Bhargava and Ng, 2022). Fortunately, with the revolutionary devel opment of pre-training (Brown et al., 2020; Chen et al., 2021; Chowdhery et al., 2022), scaling up the size of language models (LMs) has shown to confer a range of reasoning abilities, such as arithmetic (Wang et al., 2022g; Lewkowycz et al., 2022), com monsense (Jung et al., 2022; Liu et al., 2022b) and symbolic (Zhou et al., 2022a; Khot et al., 2022) rea soning. As shown in Figure 1, such abilities may be unlocked by prompting strategies (Liu et al., 2022d) (e.g., chain-of-thought (CoT) prompting (Wei et al., 2022b), generated knowledge prompting (Liu et al.,

Figure 1: Reasoning with language model prompting. In-context exemplars (colored , ), knowledge (col ored , ) or just Let’s think step by step! are as prompt to enhance language models reasoning.

推論能力は人間の知能の中心にありますが、自然言語処理（NLP）では、現代のニューラルネットワークは、彼らが言われたことや既に知っていることから推論することがほとんどできません（Duan et al., 2020; Wang et al., 2021; Bhargava and Ng, 2022）。幸いなことに、プレトレーニング（Brown et al., 2020; Chen et al., 2021; Chowdhery et al., 2022）の革命的な発展により、言語モデル（LM）のサイズを拡大することが、算術（Wang et al., 2022g; Lewkowycz et al., 2022）、常識（Jung et al., 2022; Liu et al., 2022b）および記号的（Zhou et al., 2022a; Khot et al., 2022）推論などの一連の推論能力を付与することが示されています。図1に示すように、そのような能力は、プロンプト戦略（Liu et al., 2022d）（例えば、思考の連鎖（CoT）プロンプト（Wei et al., 2022b）、生成された知識プロンプト（Liu et al.）によって解除される可能性があります。

https://gyazo.com/4cca4c34d9c316ea0baa7f35b248b99c

図1：言語モデルプロンプトを使用した推論。文脈内の例示（色付き、）、知識（色付き、）または単に「ステップバイステップで考えましょう！」が、言語モデルの推論を強化するためのプロンプトとして使用されます。

Organization of This Survey: In this paper, we conduct the first survey of recent progress in rea soning with language model prompting. We first give some preliminaries on this direction (§2) and then propose to organize relevant work by taxon omy (§3). We further provide in-depth compar isons with discussion for insights (§4). To facilitate beginners who are interested in this field, we high light some open resources (§5) and potential future directions (§6).

この論文では、言語モデルプロンプトを使用した推論の最近の進歩についての最初の調査を行います。まず、この方向性に関するいくつかの予備知識を提供し（§2）、次に分類によって関連する作品を整理することを提案します（§3）。さらに、洞察力のための議論と比較を詳しく提供します（§4）。この分野に興味がある初心者を支援するために、いくつかのオープンリソース（§5）と潜在的な将来の方向性（§6）を強調します。

2 Preliminaries

https://gyazo.com/763b023ae2b87594796e561efcbb86b5

In this section, we introduce preliminaries of rea soning with LM prompting. For standard prompt ing, given the reasoning question Q, prompt T and parameterized probabilistic model pLM, we aim to maximize the likelihood of answer A as:

このセクションでは、LMプロンプトを使用した推論の予備知識を紹介します。標準的なプロンプトについて、推論問題Q、プロンプトT、およびパラメータ化された確率モデルpLMが与えられた場合、答えAの尤度を最大化することを目指します。

https://gyazo.com/dc4057ded025ac8812250e78e702a929

where aiis the i-th token of A, and |A| denotes the length of A. For few-shot prompting, T is comprised of K exemplars of (Q, A) pair.

To enhance the reasoning ability with LM prompting, there are two major branches of re search. The first one focuses on optimizing the rea soning strategy with prompting (§3.1) as shown in Figure 2, including prompt engineering (§3.1.1) and process optimization (§3.1.2).

For prompt engineering (§3.1.1), many ap proaches try to improve the quality of prompt T, and we call those works single-stage methods, while others append ciinto the context of (T, Q) at each reasoning stage or design specific Tcifor each ci, and we regard those as multi stage methods. For example, Wei et al. (2022b) try to add reasoning steps C into prompt where T = {(Qi, Ci, Ai)}Ki=1, thus Equation 1 can be reformed to:

aiはAのi番目のトークンであり、|A|はAの長さを示します。フューショットプロンプトでは、TはK個の（Q、A）ペアの例示で構成されます。

LMプロンプトを使用した推論能力を強化するために、2つの主要な研究分野があります。1つ目は、図2に示すように、プロンプト（§3.1）を使用した推論戦略の最適化に焦点を当てています。これには、プロンプトエンジニアリング（§3.1.1）とプロセス最適化（§3.1.2）が含まれます。

プロンプトエンジニアリング（§3.1.1）については、多くのアプローチがプロンプトTの品質を向上させようとしており、これらの作品をシングルステージメソッドと呼んでいます。一方、他の作品では、各推論ステージでciを（T、Q）の文脈に追加するか、各ciに対して特定のTciを設計し、それらをマルチステージメソッドとみなします。例えば、Wei et al. (2022b)は、推論ステップCをプロンプトに追加しようとしており、T = {(Qi, Ci, Ai)}Ki=1であるため、式1は次のように変形されます。

https://gyazo.com/eb152a388798ab941d6bd682c0dd921d

with ciis one step of total |C| reasoning steps. For process optimization (§3.1.2), the simplest ways are to bring in an optimizer with parameters θ to calibrate C when generating A, and we call those works self-optimization methods. Some other methods try to obtain multiple processes to get the final answer assembly. We regard those works as ensemble-optimization methods. More over, the overall optimization process can be it eratively integrated with fine-tuning the pLM on generated triplet (Q, C, A), which are regarded as iterative-optimization methods. Besides, some works leverage external reasoning engines (§3.1.3) to produce T or directly execute C for reasoning.

The second one focuses on knowledge enhance ment with prompting (§3.2). Note that rich im plicit “modeledge” (Han et al., 2021) in LMs can generate knowledge or rationales as knowledge informed prompt T (§3.2.1). Meanwhile, explicit knowledge in external resources can also be lever aged and retrieved as knowledgeable prompts to enhance reasoning (§3.2.2).

ciは、合計|C|の推論ステップの1つです。プロセス最適化（§3.1.2）については、最も簡単な方法は、パラメータθを持つオプティマイザーを導入して、Aを生成するときにCを調整することであり、これらの作品を自己最適化メソッドと呼んでいます。他の方法では、複数のプロセスを取得して、最終的な回答アセンブリを取得しようとしています。これらの作品をアンサンブル最適化メソッドとみなします。さらに、全体的な最適化プロセスは、生成された3つ組（Q、C、A）でpLMを微調整することで反復的に統合することができます。これらは反復最適化メソッドとみなされます。また、一部の作品では、外部の推論エンジン（§3.1.3）を利用してTを生成したり、Cを直接実行して推論することができます。

2つ目は、プロンプト（§3.2）を使用した知識強化に焦点を当てています。LM内の豊富な暗黙の「モードエッジ」（Han et al., 2021）は、知識通知プロンプトT（§3.2.1）として知識や理由を生成することができます。一方、外部リソースの明示的な知識も、推論を強化するための知識豊富なプロンプトとして活用および取得することができます（§3.2.2）。

3 Taxonomy of Methods

In this paper, we survey existing reasoning with LM prompting methods, categorizing them into Strategy Enhanced Reasoning (§3.1) and Knowl edge Enhanced Reasoning (§3.2). As shown in Figure 2, we further refine them according to the distinctive features of different methods.

この論文では、既存のLMプロンプトを使用した推論方法を調査し、それらを戦略強化推論（§3.1）と知識強化推論（§3.2）に分類します。図2に示すように、異なる方法の特徴に応じて、さらに細分化します。

3.1 Strategy Enhanced Reasoning

The primary purpose of this line of work is to de sign a better reasoning strategy to enhance the per formance of LMs reasoning, concretely embodied in prompt engineering (§3.1.1), process optimiza tion (§3.1.2) and external engine (§3.1.3).

この研究の主な目的は、プロンプトエンジニアリング（§3.1.1）、プロセス最適化（§3.1.2）、および外部エンジン（§3.1.3）に具体的に体現される、LMの推論の性能を向上させるためのより優れた推論戦略を設計することです。

Single-Stage.

Early works leverage template based prompts (Paranjape et al., 2021; Rajagopal et al., 2021) for reasoning in NLP. Regarding the strong in-context learning ability of large language models (Brown et al., 2020), Wei et al. (2022b) proposes CoT prompting, which adds a series of intermediate reasoning steps, also called CoT, into exemplars of few-shot prompt to induce large lan guage models to generate a reasoning process be fore answering. Experiments demonstrate that large language models emerge with impressive rea soning abilities with CoT prompting.

In spite of the large improvement brought by CoT, in-context learning is greatly sensitive to the selection of exemplars, and even a tiny change may cause a large drop in model performance (Lu et al., 2022c; Min et al., 2022; Webson and Pavlick, 2022). Hence, the quality of exemplars appears to be particularly important. Fu et al. (2022) indi cates that prompts with higher reasoning complex ity, e.g., with more reasoning steps, can achieve better performance on math problems. Zhang et al. (2022b) explores the impact of diversity of exem plars in prompt. Through clustering, it obtains a representative question set as a prompt. By placing

more explicit explanations and natural language instructions into the prompt, Zhou et al. (2022b) relieves the ambiguity for LMs when facing out of-distribution (OOD) algorithmic problems. The above work shows that LMs can be outstanding few-shot reasoners. Surprisingly, Kojima et al. (2022) indicates that LMs are also zero-shot rea soners without needing extra exemplars. By only concatenating "Let’s think step by step", LMs can consciously generate reasoning steps.

初期の作品では、NLPで推論するためにテンプレートベースのプロンプト（Paranjape et al., 2021; Rajagopal et al., 2021）が利用されています。大型言語モデル（Brown et al., 2020）の強力な文脈内学習能力を考慮して、Wei et al. (2022b)はCoTプロンプトを提案し、中間の推論ステップのシリーズ、つまりCoTをフューショットプロンプトの例示に追加して、大型言語モデルが回答する前に推論プロセスを生成するよう促します。実験では、CoTプロンプトを使用した大型言語モデルが印象的な推論能力を発揮することが示されています。

CoTによってもたらされる大幅な改善にもかかわらず、文脈内学習は例示の選択に非常に敏感であり、わずかな変更でもモデルの性能が大幅に低下する可能性があります（Lu et al., 2022c; Min et al., 2022; Webson and Pavlick, 2022）。したがって、例示の品質は特に重要です。Fu et al. (2022)は、推論複雑度が高いプロンプト（例えば、より多くの推論ステップがある）は、数学問題でより優れた性能を発揮することができることを示しています。Zhang et al. (2022b)は、プロンプト内の例示の多様性の影響を探索しています。クラスタリングにより、代表的な質問セットをプロンプトとして取得します。Zhou et al. (2022b)は、より明確な説明と自然言語指示をプロンプトに配置することで、LMがOODアルゴリズム問題に直面したときの曖昧さを解消します。上記の作品は、LMが優れたフューショット推論者であることを示しています。驚くべきことに、Kojima et al. (2022)は、LMが追加の例示が不要なゼロショット推論者でもあることを示しています。「ステップバイステップで考えましょう」とだけ連結することで、LMは意識的に推論ステップを生成することができます。

Multi-Stage.

When human beings are reason ing, it is usually challenging to come up with the whole reasoning process in one stroke. A more intuitive solution is to decompose a complex prob lem into simpler sub-problems and reason stage by stage. Similarly, this series of works aims to transform previous one-stage prompting into multi stage prompting. Press et al. (2022) explicitly de fines follow-up questions and intermediate answers in prompts to narrow the compositionality gap in LMs. Jung et al. (2022) regard the output of each stage as a separate new question while Zhou et al. (2022a); Wang et al. (2022b) append it to the whole context to prompt LMs. Khot et al. (2022) first decomposes a task into split and merge sub-tasks and then constructs specific prompts to tackle each sub-task. Creswell and Shanahan (2022) follows a structure of Selection-Inference (Creswell et al., 2022) which selects specific contexts and infer ences based on them at each stage.

人間が推論するとき、一度に全体の推論プロセスを思いつくことは通常困難です。より直感的な解決策は、複雑な問題をより単純なサブ問題に分解し、ステージごとに推論することです。同様に、この一連の作品は、以前の1ステージプロンプトをマルチステージプロンプトに変換することを目指しています。Press et al. (2022)は、LMの組成性ギャップを狭めるために、プロンプト内でフォローアップの質問と中間回答を明示的に定義します。Jung et al. (2022)は、各ステージの出力を別々の新しい質問とみなし、Zhou et al. (2022a); Wang et al. (2022b)はそれを全体の文脈に追加してLMをプロンプトします。Khot et al. (2022)は、まずタスクを分割および統合サブタスクに分解し、次に各サブタスクを処理するための特定のプロンプトを構築します。Creswell and Shanahan (2022)は、Selection-Inference（Creswell et al., 2022）の構造に従い、各ステージで特定の文脈を選択し、それらに基づいて推論します。

3.1.2 Process Optimization

Natural language rationales2(Ling et al., 2017a), also called reasoning process in CoT, plays a vi tal role in CoT prompting (Ye and Durrett, 2022; Lampinen et al., 2022; Min et al., 2022). The consistency of the reasoning process (Wang et al., 2022g), as well as the continuity between reason ing steps (Li et al., 2022e) both should affect the accuracy of final answers. Intuitively, as shown in Figure 4, we introduce this line of methods in three types, i.e., self, ensemble, and iterative process optimization.

自然言語の理由付け2（Ling et al., 2017a）、CoTでの推論プロセスとも呼ばれる、はCoTプロンプト（Ye and Durrett, 2022; Lampinen et al., 2022; Min et al., 2022）において重要な役割を果たします。推論プロセス（Wang et al., 2022g）の一貫性、および推論ステップ間の連続性（Li et al., 2022e）は、最終的な回答の正確さに影響するはずです。直感的に、図4に示すように、この一連の方法を3つのタイプ、つまり自己、アンサンブル、反復プロセス最適化として紹介します。

Self-Optimization.

Here, self-optimization refers to correcting one process by injecting extra modules. To mitigate the influence of the unreliability of rationales, Ye and Durrett (2022) utilizes a calibrator to tune the probabilities of a prediction based on the score, which reflects the factuality of a rationale. During free-text rationales generation, Wiegreffe et al. (2022) finetunes a sequence-to-sequence model as a filter to predict whether the explanation is acceptable.

ここでの自己最適化とは、追加モジュールを注入して1つのプロセスを修正することを指します。理由付けの信頼性が低い影響を軽減するために、Ye and Durrett (2022)は、理由付けの事実性を反映するスコアに基づいて予測の確率を調整するキャリブレーターを使用します。フリーテキスト理由付け生成中、Wiegreffe et al. (2022)は、説明が受け入れ可能かどうかを予測するフィルターとしてシーケンスツーシーケンスモデルを微調整します。

Ensemble-Optimization.

Due to the limitation of only one reasoning path, the following works rely on ensemble calibration among multiple pro cesses. Wang et al. (2022g) introduce sampling strategies (Ackley et al., 1985; Fan et al., 2018) commonly used in natural language generation to obtain multiple reasoning processes and generate the most consistent answer by majority vote. Based on the motivation of when a reasoning process reaches a wrong answer, not all the steps may un dertake the final incorrectness, Li et al. (2022e) proposes a step-aware voting versifier to score each reasoning path. When disorientated majority pro cesses overwhelm reasonable minority processes, the step-aware voting versifier can alleviate the lim itation of vanilla majority vote (Wang et al., 2022g). Besides, Wang et al. (2022f) empirically observe that decoder sampling in the output space is the key to robustly improving performance because of the brittleness of manual prompt engineering.

1つの推論パスの制限のため、次の作品は複数のプロセス間のアンサンブルキャリブレーションに依存しています。Wang et al. (2022g)は、自然言語生成で一般的に使用されるサンプリング戦略（Ackley et al., 1985; Fan et al., 2018）を導入して、複数の推論プロセスを取得し、多数決によって最も一貫性のある回答を生成します。推論プロセスが誤った回答に到達した場合、すべてのステップが最終的な不正確さを引き受けるわけではないという動機に基づいて、Li et al. (2022e)は、各推論パスをスコアリングするステップアウェア投票検証器を提案します。方向が定まらない多数派プロセスが合理的な少数派プロセスを圧倒するとき、ステップアウェア投票検証器は、バニラ多数決（Wang et al., 2022g）の制限を緩和することができます。さらに、Wang et al. (2022f)は、出力空間でのデコーダサンプリングが、手動プロンプトエンジニアリングの脆弱性のために性能を確実に向上させる鍵であることを実証的に観察しています。

Iterative-Optimization.

Note that LMs can achieve excellent performance in few-shot (Wei et al., 2022b) or zero-shot (Kojima et al., 2022) manners with prompts, another paradigm is to cali brate reasoning processes iteratively with LM fine tuning. Specifically, iterative-optimization-based methods try to repeat the process of prompting LMs to generate reasoning processes and use the instances with generated reasoning processes to finetune themselves. Zelikman et al. (2022) initi ates with a small set of exemplars to push LMs to produce reasoning steps and answers themselves. Questions and reasoning steps with the correct an swers will be directly added to the dataset for fine tuning. Incorrect ones will be fed into the model again by being tagged on a hint that labels the cor rect answer. Compared with Zelikman et al. (2022), Huang et al. (2022) do not need gold labels during self-teaching. Following Wang et al. (2022g), it generates multiple reasoning processes and keeps the ones that lead to the most consistent answer. Then it finetunes the model on these self-generated reasoning-answer data. Wang et al. (2022a) pro poses an iterative context-aware prompter which learns to dynamically synthesize prompts condi tioned on the contexts of current step.

LMは、プロンプトを使用して、フューショット（Wei et al., 2022b）またはゼロショット（Kojima et al., 2022）の方法で優れたパフォーマンスを発揮できることに注意してください。別のパラダイムでは、LMの微調整によって推論プロセスを反復的に校正します。具体的には、反復最適化ベースの方法は、LMに推論プロセスを生成するように促し、生成された推論プロセスを持つインスタンスを使用して自分自身を微調整するプロセスを繰り返そうとします。Zelikman et al.（2022）は、LMが推論ステップと答えを自分で生成するように押し出すための小さな例示集合から開始します。正解の質問と推論ステップは、微調整のためのデータセットに直接追加されます。不正解なものは、正解をラベル付けするヒントにタグ付けされて、再度モデルにフィードされます。Zelikman et al.（2022）と比較して、Huang et al.（2022）は自己教育中にゴールドラベルが必要ありません。Wang et al.（2022g）に従い、複数の推論プロセスを生成し、最も一貫性のある答えにつながるものを保持します。その後、これらの自己生成された推論-回答データでモデルを微調整します。Wang et al.（2022a）は、現在のステップの文脈に条件付きでプロンプトを動的に合成するよう学習する反復コンテキストアウェアプロンプターを提案します。

3.1.3 External Engine

When reasoning with LM prompting, the models should have the ability of semantic understanding (e.g., questions) and complex reasoning (e.g., by generating reasoning processes); however, we can not have both fish and bear’s paw (Hendrycks et al., 2021; Nogueira et al., 2021; Lewkowycz et al., 2022). To tear up the obstacle, external reasoning engines lend a helping hand to LMs (see Figure 5).

LMプロンプティングで推論する場合、モデルには意味理解（例えば、質問）と複雑な推論（例えば、推論プロセスの生成）の能力が必要ですが、魚と熊の手を両方持つことはできません（Hendrycks et al., 2021; Nogueira et al., 2021; Lewkowycz et al., 2022）。この障害を取り除くために、外部の推論エンジンがLMに助け舟を出します（図5参照）。

Physical Simulator.

Given a physical reasoning question, Liu et al. (2022e) utilizes a computational physics engine (Todorov et al., 2012) to simulate the physical process. The simulation results are treated as prompt to help LMs reason, making up for the lack of physical knowledge in LMs.

物理的な推論問題が与えられた場合、Liu et al. (2022e)は、計算物理エンジン（Todorov et al., 2012）を使用して物理的なプロセスをシミュレートします。シミュレーション結果は、プロンプトとして扱われ、LMの物理的知識の欠如を補うために、LMの推論を支援します。

Code Interpreter.

With the emergence of LMs of code (Chen et al., 2021; Xu et al., 2022), collabo rating LMs and codes to tackle specific tasks has re cently sprung up (Wang et al., 2022e; Cheng et al., 2022; Wu et al., 2022b). Note that programs yield advantage behaviors in robustness and interpretabil ity and can better illustrate complex structures and deduct complex calculations. Intuitively, Madaan et al. (2022) reframes structured commonsense rea soning tasks as code generation tasks, replacing the natural language with python class code to repre sent structured graph both in few-shot prompts and LM outputs. Gao et al. (2022) decomposes solv ing step from LMs to programmatic runtime and mainly remain to learn task for LMs. In few-shot prompts and LM outputs, the reasoning processes are replaced by a mixture of natural and program ming language, where natural language is treated as annotations to aid the generation of the program. Similar to Gao et al. (2022), Chen et al. (2022b) proposes program of thoughts (PoT) prompting which disentangling computation from reasoning. The main difference is that it also puts forward a zero-shot format of PoT prompting.

コードのLM（Chen et al., 2021; Xu et al., 2022）の出現に伴い、特定のタスクに対応するためにLMとコードを協調させることが最近登場しました（Wang et al., 2022e; Cheng et al., 2022; Wu et al., 2022b）。プログラムは、堅牢性と解釈性において優れた振る舞いを示し、複雑な構造や複雑な計算をよりよく説明することができます。直感的に、Madaan et al. (2022)は、構造化された常識推論タスクをコード生成タスクとして再定義し、自然言語をpythonクラスコードに置き換えて、フューショットプロンプトとLM出力の両方で構造化されたグラフを表現します。Gao et al. (2022)は、LMからプログラム実行時までの解決ステップを分解し、主にLMの学習タスクを維持します。フューショットプロンプトとLM出力では、推論プロセスが自然言語とプログラミング言語の混合物に置き換えられ、自然言語はプログラムの生成を支援する注釈として扱われます。Gao et al. (2022)と同様に、Chen et al. (2022b)は、計算から推論を分離する思考のプログラム（PoT）プロンプティングを提案します。主な違いは、ゼロショット形式のPoTプロンプティングも提案していることです。

https://gyazo.com/ed4afb0a666345e17d99b2fd5ecadf49

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

https://arxiv.org/pdf/2211.12588.pdf

Program of Thoughtの論文。プログラミング言語で書かれたコードを生成し、最終的に答えを生成する方法。step by stepより12%の精度が向上したよって話。ZSも紹介

四則演算できるテーマで性能向上しそう。直近の話題だと、落合陽一の般若心経のモデル化なのかな

プログラミングされたコードは、自然言語より解釈性や堅牢性が高いので、推論の質が改善されますよって話なのであれば、自然言語でも解釈性や堅牢性の高い指示を出せば違いは減る…のかな。言語化能力の高い人が優位だなー

3.2 Knowledge Enhanced Reasoning

Knowledge is the cornerstone of reasoning. Knowl edge enhanced methods aim to prompt LMs with implicit (§3.2.1) or explicit (§3.2.2) knowledge to assist in reasoning (see Figure 6).

知識は推論の基石です。知識強化手法は、暗黙的（§3.2.1）または明示的（§3.2.2）な知識をLMに促して、推論を支援することを目指しています（図6を参照）。

3.2.1 Implicit Knowledge

Researchers have shown that LMs contain consid erable implicit knowledge which can be elicited via conditional generation (Davison et al., 2019; Petroni et al., 2019; Jiang et al., 2020). The fol lowing works try to induce such “modeledge” as knowledge-informed prompts for reasoning.

Liu et al. (2022c) applies GPT-3 (Brown et al., 2020) with few-shot prompting to generate knowledge and prompts the downstream LM. ased on this, Liu et al. (2022b) draws support from reinforcement learning (Schulman et al., 2017) to further calibrate the knowledge. Different from the above, which only uses few-shot prompting in the knowl edge generation stage, Sun et al. (2022) proposes a two-stage generative prompting which addition ally includes answer generation prompts. Li et al. (2022c) and Wang et al. (2022c) both follow the paradigm of generating explanations with prompt ing a larger LM and then finetuning on a smaller LM. They mainly use the strong generation ability of LMs with few-shot prompting.

研究者たちは、LMにはかなりの暗黙的な知識が含まれており、条件付き生成によって引き出すことができることを示しています（Davison et al., 2019; Petroni et al., 2019; Jiang et al., 2020）。以下の作品は、そのような「モードレッジ」を推論のための知識に基づくプロンプトとして誘発しようとしています。

Liu et al. (2022c)は、GPT-3（Brown et al., 2020）をフューショットプロンプトで適用し、下流のLMに知識とプロンプトを生成します。これに基づいて、Liu et al. (2022b)は強化学習（Schulman et al., 2017）からサポートを受けて、知識をさらに調整します。上記と異なり、知識生成段階でフューショットプロンプトのみを使用するものであるが、Sun et al. (2022)は、回答生成プロンプトも含む2段階の生成的プロンプトを提案します。Li et al. (2022c)およびWang et al. (2022c)は、大型LMにプロンプトして説明を生成し、小型LMでファインチューニングするパラダイムに従います。彼らは主にLMの強い生成能力とフューショットプロンプトを使用します。

3.2.2 Explicit Knowledge

Although large LMs have shown strong generation ability (Wiegreffe et al., 2022; Li et al., 2022c; Wang et al., 2022c), they still have the tendency to hallucinate facts (Rohrbach et al., 2018) and generate inconsistent knowledge (Liu et al., 2022b). Recent works show that retrieving prompts for in context learning is a nice means to achieve good performance (Liu et al., 2022a; Rubin et al., 2022).

Due to the instability of Liu et al. (2022a) to measure the similarity of structured information, Lu et al. (2022b) propose a dynamic prompt re trieval method based on policy gradient strategy, without brute-force searching. Su et al. (2022) for mulates a selective annotation framework to avoid the need for a large labeled retrieval corpus. It de velops a graph-based method to construct a diverse and representative small labeled database as much as possible from a large unlabeled corpus. Then the in-context labeled examples can be retrieved from the small database, which largely reduces the cost of annotation and retrieval.

大型LMは強い生成能力を示していますが（Wiegreffe et al., 2022; Li et al., 2022c; Wang et al., 2022c）、それでも事実を幻覚する傾向があり（Rohrbach et al., 2018）、一貫性のない知識を生成することがあります（Liu et al., 2022b）。最近の研究では、コンテキスト学習のためのプロンプトの取得が良好なパフォーマンスを達成するための良い手段であることが示されています（Liu et al., 2022a; Rubin et al., 2022）。構造化された情報の類似性を測定するLiu et al. (2022a)の不安定性により、Lu et al. (2022b)は、ポリシーグラデーション戦略に基づく動的プロンプト再取得方法を提案しています。これにより、ブルートフォース検索を行わずに済みます。Su et al. (2022)は、大規模なラベル付き検索コーパスが不要な選択的アノテーションフレームワークを定式化しています。大規模なラベルなしコーパスから、多様で代表的な小さなラベル付きデータベースをできるだけ構築するグラフベースの方法を開発しています。その後、小さなデータベースからコンテキスト内のラベル付き例が取得されます。これにより、アノテーションと検索のコストが大幅に削減されます。

4 Comparison and Discussion

4.1 Comparison of Language Models

Table 1 shows four comparison scopes of differ ent methods. We further illustrate the perfor mance comparison of LMs with different scales on GSM8K (Cobbe et al., 2021) of arithmetic reason ing in Figure 7. Similar results on commonsense reasoning benchmarks are shown in Appendix B.

Wei et al. (2022b) systematically demonstrates that few-shot prompting performs better in almost all tasks as model scale increases, which can be explained by the fact that LMs with larger model size contain more implicit knowledge for rea soning (Liang et al., 2022b). Moreover, CoT prompting produces much greater increases, with PaLM-540B showing the greatest improvements, as depicted in Figure 7&8. However, when the model scale declines to less than 100B, CoT prompting will yield no performance gain and may even be detrimental. Thus, CoT prompting elicits an emergent ability of model scale, which is de fined as abilities of pre-trained LMs which are not present in smaller-scale models but in large-scale models (Wei et al., 2022a). Another intriguing ob servation is depicted in Figure 7&8 that PaLM-62B (Chowdhery et al., 2022) even performs better than LaMDA-137B (Thoppilan et al., 2022), possibly because it was trained on the higher-quality corpus.

Notably, Figure 7&8 also illustrates that holding the same parameter scale, Codex (Chen et al., 2021) outperforms GPT-3 significantly3, even though the major difference between them is the training cor pus (Codex is a GPT-3 variant training on code). This phenomenon can also be inspected in recent works (Zhou et al., 2022a; Li et al., 2022e; Fu et al., 2022; Zhang et al., 2022b; Madaan et al., 2022; Liang et al., 2022b), indicating that pre-training on code branch not only enables the ability of code generation/understanding but may also trigger the reasoning ability with CoT. The exact cause is still elusive, but one theory could be that code is a more reasonable form of text, thinking about procedure-oriented programming is analogous to solving problems step by step, and objectoriented programming is analogous to decomposing complex tasks into simpler ones .

表1は、異なる方法の4つの比較範囲を示しています。さらに、図7で算術推論のGSM8K（Cobbe et al., 2021）における異なるスケールのLMのパフォーマンス比較を説明します。常識推論ベンチマークの類似結果は、付録Bに示されています。 Wei et al. (2022b)は、モデルスケールが増加するにつれて、ほぼすべてのタスクでフューショットプロンプトが優れていることを体系的に示しています。これは、より大きなモデルサイズのLMが推論のためのより多くの暗黙的な知識を含んでいることによって説明できます（Liang et al., 2022b）。さらに、CoTプロンプトははるかに大きな増加を生み出し、PaLM-540Bが最大の改善を示しています（図7＆8）。ただし、モデルスケールが100B未満に減少すると、CoTプロンプトはパフォーマンス向上をもたらさず、悪影響さえ及ぼす可能性があります。したがって、CoTプロンプトはモデルスケールの新しい能力を引き出します。これは、小型モデルでは存在しないが大型モデルでは存在する事前学習されたLMの能力と定義されます（Wei et al., 2022a）。もう1つ興味深い観察結果は、図7＆8で示されており、PaLM-62B（Chowdhery et al., 2022）はLaMDA-137B（Thoppilan et al., 2022）よりも優れています。これは、高品質コーパスでトレーニングされたためです。注目すべきことに、図7＆8では、同じパラメータースケールを保持する場合、Codex（Chen et al., 2021）はGPT-3よりも大幅に優れています3。彼らの主な違いはトレーニングコーパスです（CodexはコードでトレーニングされたGPT-3バリアント）。この現象は最近の作品でも調査されています（Zhou et al., 2022a; Li et al., 2022e; Fu et al., 2022; Zhang et al., 2022b; Madaan et al., 2022; Liang et al., 2022b）。これは、コードブランチでの事前学習がコード生成/理解能力だけでなくCoTで推論能力も引き起こす可能性があることを示しています。正確な原因はまだ不明ですが、1つの理論として、コードがより合理的なテキスト形式であることが挙げられます。手続き指向プログラミングについて考えることが

4.2 Comparison of Prompts

https://gyazo.com/9cc7b4ecb51557500a2884562691f77e

Table 1 shows the comparison of different meth ods of reasoning with LM prompting. There are three main sources of prompts for existing methods: 1) Manual construction is suitable for template-based prompts and few-shot prompting where the prompt is uncomplicated. 2) LM Gen erated prompt makes up for the shortcomings of manual construction prompt. It can customize spe cific rationales for each question and provide suf ficient knowledge with the prompt for fine-tuning or self-training. 3) Retrieval-based prompt often relies on well-annotated external resources (e.g., Wikipedia) and consumes expensive information retrieval, but it can alleviate the unstable issue of the generation.

We observe that no matter how prompt is pro duced, CoT only works on large LMs under few shot prompting. Combined with the empirical conclusion in Ye and Durrett (2022), these phe nomena reveal that explicit high-quality reason ing rationales contained in the input context are the keys for reasoning with LM prompting. Al though some works have attempted to explore the in-context learning ability on large LMs (Xie et al., 2022; Min et al., 2022; Akyürek et al., 2022), the reason why CoT prompting can succeed on large LMs is still intriguing to the community and not well-understood. One possible hypnosis is that CoT is a magical side product of training on code and unlocked by prompt. Note that exemplars con taining CoT in few-shot prompts can be viewed as a kind of instruction that arouses the reasoning ability hidden in large LMs. Chung et al. (2022) verifies the similar result using CoT in instruction fine-tuning to further advance model performance.

表1は、LMプロンプトを使用した推論の異なる方法の比較を示しています。既存の方法には、3つの主要なプロンプトのソースがあります。1）手動構築は、テンプレートベースのプロンプトやフューショットプロンプトに適しており、プロンプトが複雑ではありません。2）LM生成プロンプトは、手動構築プロンプトの欠点を補います。それは、各質問に対して特定の根拠をカスタマイズし、ファインチューニングまたは自己学習のために十分な知識をプロンプトで提供することができます。3）検索ベースのプロンプトは、よく注釈付きの外部リソース（例：Wikipedia）に依存し、高価な情報検索を消費しますが、生成の不安定な問題を軽減することができます。

CoTは、いかなるプロンプトが生成されても、少数のショットプロンプトの下で大型LMでのみ機能することがわかります。Ye and Durrett (2022)による実証的結論と組み合わせると、これらの現象は、入力コンテキストに含まれる明示的な高品質の推論根拠がLMプロンプトによる推論の鍵であることを示しています。大型LM上のインコンテキスト学習能力を探索しようとするいくつかの作品がありますが(Xie et al., 2022; Min et al., 2022; Akyürek et al., 2022)、CoTプロンプトが大型LMで成功する理由はまだコミュニティにとって興味深く、十分に理解されていません。一つの可能性として、CoTはコード上でのトレーニングの魔法的な副産物であり、プロンプトによって解除されるという仮説があります。少数のショットプロンプトに含まれるCoTを含む例示は、大型LMに隠された推論能力を喚起する一種の指示と見なすことができます。Chung et al. (2022)は、指示微調整におけるCoTを使用して、モデル性能をさらに向上させる同様の結果を検証しています。

5 Benchmarks and Resources

5.1 Taxonomy of Benchmarks and Tasks

Researchers in the NLP community have released many benchmarks requiring various reasoning skills, including arithmetic reasoning, common sense reasoning, logical reasoning, symbolic rea soning and multimodal reasoning. In this section, we will give a brief overview of these reasoning benchmarks and tasks. More details of broader benchmarks, as well as reasoning with ChatGPT can be found in Appendix C and D.

NLPコミュニティの研究者たちは、算術推理、常識推理、論理推理、記号推理、多モーダル推理など、様々な推理スキルを必要とする多くのベンチマークをリリースしています。このセクションでは、これらの推理ベンチマークとタスクの概要を簡単に説明します。より広範なベンチマークの詳細や、ChatGPTでの推理については、付録CおよびDで説明します。

Arithmetic Reasoning.

Arithmetic reasoning, also referred to as mathematical reasoning, is the ability to perform reasoning on math word prob lems (MWP). Arithmetic reasoning skills are of great importance abilities of human intelligence and are also essential for general-purpose artificial intelligent systems. Early works on this task (Hos seini et al., 2014; Kushman et al., 2014; Roy et al., 2015; Koncel-Kedziorski et al., 2015; Roy and Roth, 2015; Ling et al., 2017b) focus on relatively small datasets consisting of grade school single step or multi-step MWP, relevant math operations of which cover +, −, ×, ÷. Later works increase in complexity and scale, and other datasets are pro posed to increase the difficulties. Most recently, Mishra et al. (2022a) extends existing datasets to construct a unified benchmark concerning mathe matical abilities, language format, language diver sity and external knowledge.

算術推理、または数学的推理とも呼ばれる、数学の文章問題（MWP）に対する推理能力です。算術推理スキルは、人間の知能の重要な能力であり、汎用人工知能システムにも不可欠です。このタスクに関する初期の研究（Hosseini et al., 2014; Kushman et al., 2014; Roy et al., 2015; Koncel-Kedziorski et al., 2015; Roy and Roth, 2015; Ling et al., 2017b）は、小学校の単一ステップまたは複数ステップのMWPからなる比較的小さなデータセットに焦点を当て、関連する数学演算は+、−、×、÷をカバーしています。後期の研究では複雑さと規模が増加し、他のデータセットが提案されて難易度が上がっています。最近では、Mishra et al. (2022a)は既存のデータセットを拡張して、数学能力、言語形式、言語多様性、外部知識に関する統一されたベンチマークを構築しています。

Commonsense Reasoning.

Commonsense knowledge and commonsense reasoning are some of the major issues in machine intelligence (Storks et al., 2019; Bhargava and Ng, 2022). When answering a question, people often draw upon their rich world knowledge. For LMs, the major challenge of performing commonsense reasoning lies in how to involve physical and human interactions under the presumption of general background knowledge (Bhargava and Ng, 2022). Many benchmark datasets and tasks (Clark et al., 2018; Mihaylov et al., 2018; Talmor et al., 2019; Bisk et al., 2020; Geva et al., 2021) are designed to evaluate the ability of machines to learn commonsense knowledge in order to reason natural language text. The most widely used benchmark today is CommonsenseQA (Talmor et al., 2019), which focuses on commonsense question answering, based on knowledge encoded in ConceptNet (Speer et al., 2017).

常識的な知識と常識的な推論は、機械知能の主要な問題のいくつかです(Storks et al., 2019; Bhargava and Ng, 2022)。質問に答えるとき、人々はしばしば豊富な世界の知識に基づいています。LMにとって、常識的な推論を行う上での主要な課題は、一般的な背景知識の前提の下で物理的および人間的な相互作用をどのように関与させるかにあります(Bhargava and Ng, 2022)。多くのベンチマークデータセットやタスク(Clark et al., 2018; Mihaylov et al., 2018; Talmor et al., 2019; Bisk et al., 2020; Geva et al., 2021)が、自然言語テキストを推論するために機械が常識的な知識を学習する能力を評価するために設計されています。現在最も広く使用されているベンチマークはCommonsenseQA(Talmor et al., 2019)で、ConceptNet(Speer et al., 2017)にエンコードされた知識に基づく常識的な質問応答に焦点を当てています。

Logical Reasoning.

Common forms of logical reasoning include deductive reasoning and induc tive reasoning. Deductive reasoning is performed by going from general information to specific con clusions; typical datasets in this field consist of syn thetic rule bases plus derived conclusions (Clark et al., 2020; Tafjord et al., 2021). Recently, Dalvi et al. (2021) creatively proposes a dataset to con tain multi-step entailment trees, aiming to fulfill models with the ability to generate explanations showing the line of reasoning from what is known to the answer. As opposed to deductive reasoning, inductive reasoning aims to draw conclusions by going from the specific to the general. Sinha et al. (2019) constructs a diagnostic benchmark requiring LM’s abilities of both extracting relations between entities as well as generating the logical rules.

論理的な推論の一般的な形態には、演繹的推論と帰納的推論があります。演繹的推論は、一般的な情報から特定の結論に至ることによって行われます。この分野の典型的なデータセットは、合成されたルールベースと派生した結論から構成されます(Clark et al., 2020; Tafjord et al., 2021)。最近、Dalvi et al. (2021)は、多段階の含意ツリーを含むデータセットを創造的に提案し、既知のものから答えまでの推論の流れを示す説明を生成する能力を備えたモデルを実現することを目指しています。演繹的推論とは対照的に、帰納的推論は、特定から一般へと結論を導くことを目指しています。Sinha et al. (2019)は、エンティティ間の関係を抽出する能力と論理ルールを生成する能力の両方が必要な診断ベンチマークを構築しています。

Symbolic Reasoning.

Symbolic reasoning here only refers to a narrow collection of simple tasks that test a diverse set of symbolic manipulation functions, rather than symbolic AI, which is a more general concept implemented by rules engines or expert systems, or knowledge graphs. The con struction of these tasks are usually well-defined by human; thus, it’s easy to split the test set into in-domain test sets as well as out-of-domain test sets. Typical symbolic reasoning tasks include last letter concatenation, reverse list and coin flip (Wei et al., 2022b).

ここでの記号的推論は、ルールエンジンやエキスパートシステム、知識グラフによって実装されるより一般的な概念である記号的AIではなく、多様な記号操作機能を試す簡単なタスクの狭い集合を指します。これらのタスクの構築は通常、人間によって明確に定義されているため、テストセットをインドメインのテストセットとアウトオブドメインのテストセットに簡単に分割することができます。典型的な記号的推論タスクには、最後の文字の連結、リストの反転、コインフリップ（ウェイら、2022b）が含まれます。

Multimodal Reasoning.

Most existing bench marks for reasoning are restricted to the textual only modality and limited domain diversity. How ever, humans utilize the information available across different modalities when performing rea soning. To this end, multimodal reasoning benchmarks are presented to narrow this gap. Zellers et al. (2019) seeks to answer cognition-level ques tions from images, and Park et al. (2020) checks how well PLMs reason about the dynamic con text from a static image and an event. Recently, Lu et al. (2022a) present ScienceQA, a large-scale multimodal multiple choice dataset that consists of diverse questions of science topics with corre sponding answers and explanations. Zhang et al. (2022a) proposes the new task of multimodal ana logical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge.

Apart from the above-mentioned specific rea soning tasks, there are some benchmarks (Lake and Baroni, 2017; Srivastava et al., 2022) that can evaluate the model’s more diverse and generalized reasoning capabilities, which can also be included in the category of reasoning tasks. Most recently, Yu et al. (2022) introduces ALERT, a benchmark that spans over 20 datasets and covers 10 different reasoning skills, to assess different LMs on fine grained reasoning skills.

現存する推論のベンチマークのほとんどは、テキストのみのモダリティに制限され、ドメインの多様性も限定的です。しかし、人間は異なるモダリティ間で利用可能な情報を活用して推論を行います。このため、このギャップを埋めるために、マルチモーダル推論のベンチマークが提案されています。Zellersら（2019）は画像から認知レベルの問題に答えることを試み、Parkら（2020）は静止画とイベントからの動的コンテキストについてPLMがどの程度推論するかを確認します。最近では、Luら（2022a）は、ScienceQAという大規模なマルチモーダル選択肢データセットを提案しており、科学トピックの多様な質問とそれに対応する回答や説明が含まれています。Zhangら（2022a）は、知識グラフ上でのマルチモーダル類推推論という新しいタスクを提案し、背景知識の助けを借りてマルチモーダル推論能力が必要とされます。

上記で述べた特定の推論タスクとは別に、いくつかのベンチマーク（Lake and Baroni, 2017; Srivastava et al., 2022）では、モデルのより多様で一般化された推論能力を評価することができ、推論タスクのカテゴリに含めることができます。最近では、Yuら（2022）がALERTというベンチマークを紹介しており、20以上のデータセットにまたがり、10種類の異なる推論スキルを網羅しており、細かい推論スキルで異なるLMを評価することができます。